0.1 Introduction

Much of the core coursework for an MS in Statistics traditionally treats classical statistical methods and studies their properties in detail. In recent years, machine learning (ML) methods have emerged with great success in a wide array of applications, prompting statisticians to rethink many of the same statistical methods from a machine learning perspective, often blurring the boundary between statistics and ML. This MS project sets out in an attempt to round the MS curriculum with some popular statistical machine learning methods by treating them as black-box instrumentations with tunable knobs for various configurations. As the statistician would soon realize, all of these black boxes are subject to a common constraint known as the bias-variance tradeoff, one that governs the efficacy of many statistical methods with hyperparameters. This conceptual caricature (Figure 1) of a statistician fumbling with some knobs protruding out of a black box epitomizes the process of priming a statistical method with the optimal hyperparameter(s). Through the explorations in this project, the hope is that the data analyst can gain a deep appreciation of the bias-variance tradeoff (what it means and how it can be quantified) via simulation studies, and at the end of the day will be ready to apply the very same principles to real data sets by appealing to its close cousin, cross-validation, and be able to tune those methods “just right” for the application.

A typical day as a statistician

Figure 1: A typical day as a statistician

0.2 Background and Methods

We read the popular book An Introduction to Statistical Learning in its entirety both as an opportunity to review and reflect on some of the same statistical methods covered in the MS program from a statistical learning perspective and as a hands-on introduction to other mainstream statistical learning methods. The entire book serves as an inspiration for the simulation studies in this MS project. (N.B.: In what follows, unless otherwise specified, we use the term “test set” interchangeably with “validation set” as long as there is no confusion.)

0.2.1 Data Setting and the Bias-Variance Decomposition

In a general univariate setting where we observe a quantitative response \(Y\) and \(p\) predictors \(X = (X_1, X_2, \ldots, X_p)\), the central task in statistical learning is about estimating a fixed but unknown function \(f\) that maps \(X\) to \(Y\) in the following fashion:

\[Y = f(X) + \epsilon\]

where \(\epsilon\) is a random error term independent of X with 0 mean and variance \(\sigma^2\).

If we denote our best estimate of the true function \(f\) as \(\hat f\), which can be found by countless algorithms used for supervised learning, it turns out that whichever function \(\hat {f}\) we select, we can decompose its expected error on an unseen sample \(X\) as follows:

\[\begin{equation} E_{D, \epsilon} [(Y - \hat f(X; D))^2] = (Bias_D [\hat f(X; D)])^2 + Var_D [\hat f(X; D)]) + \sigma^2 \tag{1} \end{equation}\]

where

\[\begin{equation} Bias_D[\hat f(X; D) = E_D[\hat f(X; D)]] - f(X) \tag{2} \end{equation}\]

and

\[\begin{equation} Var_D [\hat f(X; D)] = E_D[(E_D [\hat f(X; D)] - \hat f(X; D))^2] \tag{3} \end{equation}\]

The expectation ranges over different choices of the training set \(D = {(X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)}\), all sampled from the same joint distribution \(P(X, Y)\).

The \(\sigma^2\) component is referred to as the irreducible error since this part will be present even if we can estimate the true underlying function \(f\) perfectly (i.e., no bias and no variance). Thus \(\sigma^2\) is the lowest achievable predictive MSE and the goal of statistical learning is to get as close to it as possible.

It is noteworthy that the estimator \(\hat f(X; D)\) can be found by a variety of statistical method. Furthermore, for each family of statistical method indexed by hyperparameters, the resulting \(\hat f\) depends on the chosen hyperparameter, which in turn results in a different decomposition of the total expected mean square error into a bias and a variance component. It is natural to then seek a set of hyperparameters that results in low bias and low variance simultaneously so as to improve the overall expected prediction error. Unfortunately, the two components generally trade with one another: as the bias increases, the variance tends to decrease, and vice versa. It is, however, possible to find the middle ground at which the sum of the two contributions is at the lowest possible for a specific data setting and a given family of statistical methods.

0.2.2 Method of Simulating the Bias and Variance Tradeoff

The bias-variance tradeoff is usually visualized as a plot of the total expected mean squared error (MSE) along with its three components (bias, variance and the irreducible error) against one or more hyperparameters, which often represent the degree of flexibility/complexity of a statistical learning method. In general, we expect that as the flexibility increases, the bias will decrease and the variance will increase. This is because a more flexible method learns the training data harder by picking up more nuances, but at the same time suffers from the random noises that accompany the real signal as they, too, are learned by heart. This results in predictions that can vary a lot (high variance) depending on the particular training and test data, but on average does well (low bias).

To make a graph depicting the aforementioned relationship, it is practically unfeasible to appeal to real data sets for two reasons: 1) There is typically no way to determine the ground truth from a real data set; and 2) We need many independent data sets from the same data-generating mechanism to calculate the bias and the variance. On the other hand, simulation studies are promising as they allow for customizing the true underlying function while affording an abundance of data. The choice of those functions \(f\) is somewhat arbitrary, but can be easily modified based on any real data problems motivating the simulation studies. For the purpose of this project though, it suffices to investigate a small number of functions with varying complexity, and observe the resulting tradeoffs for some important families of statistical learning methods. Another design decision is with regard to the distribution from which the predictors \(X\) are drawn. Once again though, this is best motivated by a real data problem and the simulation settings can be easily adapted to it. For simplicity, we choose to use uniform distributions. Last but not least, we shall also decide on the error distribution. The theoretical decomposition of the mean squared error into its components dictates that the errors be drawn from a distribution of zero mean and finite variance, but does not otherwise specify the shape of the distribution. It turns out that the particular distribution does not play a huge role in the bias-variance tradeoff. We experiment with a variety of error distributions for some statistical methods, but stick to the normal distribution for the rest.

Once the data setting is determined, we just need to deploy the statistical methods onto the data and see how they perform on average. This can be recorded as the bias-variance tradeoff curve. The computations will essentially emulate the expectations in Equations (1), (2), and (3) by averaging over many independent, identically distributed data sets (a technique known as Monte Carlo simulation), and this will be done for each set of hyperparameters to trace out a path (or high-dimensional surface) over which the expected MSE and its components (bias, variance, and irreducible error) travel. The task of the statistician then becomes simple: visualize the resulting bias-variance tradeoff and cherry-pick the sweet spot at which the minimum of the expected MSE is attained and get a sense of how low that can be and at what hyperparameter values.

The actual implementation is procedural, and goes as follows (note that the phrasing “set of hyperparameters” is used to account for the possibility that a statistical learning method can be tuned by more than one hyperparameter):

  1. Generate Training Sets:

    • \(n_{sim}\) i.i.d. data sets each containing \(n_{sample}\) i.i.d. observations of the form \((X, Y)\), where

      • \(X \sim P_X(x)\),
      • \(Y = f(X) + \epsilon\), and
      • \(\epsilon \sim P_{\epsilon}(e)\)
      • \(P_X\) and \(P_{\epsilon}\) are distributions from which the predictors \(X\) and random noises \(\epsilon\) are drawn, respectively
      • \(X\) can be either univariate or multivariate
  2. Generate Test Sets:

    • Same as in step 1, except that we randomly generate \(n_{sample}\) observations of the predictors \(X\) only once (instead of drawing \(n_{sim}\) times), and replicate those same (randomly drawn) \(X\) values \(n_{sim}\) times. We generate the response \(Y\) values the same way as before (that is, the errors will be drawn independently \(n_{sim}\) times).
  3. Deploy the Statistical Method:

    • At this point we have \(n_{sim}\) i.i.d. training sets each containing \(n_{sample}\) observations of the form \((X, Y)\) and \(n_{sim}\) test sets each containing \(n_{sample}\) observations of the form \((X, Y)\) where the \(X\) values are random but shared across the \(n_{sim}\) sets of \(Y\) values. (We must have the same set of \(n_{sample}\) observations of \(X\) across the \(n_{sim}\) Monte Carlo iterations in order for the notions of bias and variance to be meaningful.)
    • Train the learning method configured at a given set of hyperparameters on each of the \(n_{sim}\) training sets and predict the responses on a corresponding test set. We should obtain a prediction matrix of dimensions \(n_{sim} \times n_{sample}\).
  4. Compute Bias, Variance and MSE according to Equations (2), (3), and (1) respectively from the prediction matrix (along with the true function \(f\) and the response values \(Y\) in the test sets) for that given set of hyperparameters.

    • At this point, the prediction matrix is reduced to a single number, one for each of the quantity of interest, at the given set of hyperparameters.
  5. Repeat steps 1-4 over a grid of (sets of) hyperparameters and visualize the resulting mean squared error profile and how it is decomposed into its components.

0.2.3 Method of Cross-Validation

The book (James et al. 2013) discusses how to carry out cross-validation in detail, here we reiterate some of the key points.

We generate one single set of data containing observations of the form \((X, Y)\) from the desired distribution. This approach involves randomly dividing the set of observations into \(K\) groups, or folds, of approximately equal size. The first fold is treated as a test set (or a validation set, that’s where the name “cross-validation” comes from…), and the method is fit on the remaining \(K - 1\) folds. The mean squared error, \(MSE_1\), is then computed on the observations in the held-out fold. This procedure is repeated \(K\) times; each time, a different group of observations is treated as the test set. This process results in \(K\) estimates of the test error, \(MSE_1\), \(MSE_2\), \(\ldots\) , \(MSE_K\). The K-fold CV estimate is computed by averaging these values, \[\begin{equation} CV_{(K)} = \frac{1}{K}\sum_{i=1}^K MSE_i \tag{4} \end{equation}\]

0.2.4 Relationship between the Bias-Variance Tradeoff and Cross-Validation

The ideal of the Bias-Variance Tradeoff curve/surface confers key insights into how to best prime a given family of statistical learning methods so it is well-poised to understand unseen data. In real life, however, the statistician typically does not have access to an infinite pool of observations, let alone the true underlying function. Does this inconvenient reality jeopardize the statistician’s mission of seeking the holy grail holding the best set of hyperparameters? A close inspection of Equations (1), (2), and (3) does indeed suggest that the lack of an abundance of data deprives the variance term of its meaning, whereas the ignorance of the underlying function hides the bias term from sight. Nevertheless, it does seem as though the MSE term stands a chance of being able to be calculated so long as we have some data available. And while the bias and variance terms contain important information about how the sweet spot is attained by revealing how the components of the total expected MSE trade off with one another, the trajectory of the MSE term alone suffices to inform “where” that sweet spot is. It is this observation that gives hope to the quest of the holy grail and makes the statistician realize that the cross-validated MSE can be an important proxy for the true MSE evaluated over many training and test sets.

Indeed, Figure 2 illustrates the conceptual link between the bias-variance tradeoff simulation and the cross-validation. The left panel illustrates the setup for simulating the bias-variance trade off: \(n_{sim} = N\) different pairs of training and test sets are generated and the statistical method learns each of the training sets before taking a corresponding test. On the other hand, the right panel illustrates the setup for cross-validation: the same one data set is folded into \(K\) parts and each part “rotates” to serve as the test set once with the remaining (\(K-1\)) folds serving as the training set. For the sake of conceptual simplicity, if we assume that \(N \approx K\) and the training and test sets between the two panels are of comparable size, then both the bias-variance tradeoff simulation and the cross-validation take the same amount of “work” and one can see a direct correspondence in the roles the blue and magenta areas play between the two settings. This mental model immediate sheds light on Equation (4). The reason we average the MSE over the \(K\) folds is that we want to obtain an estimate for the expectation of the MSE over many held-out sets, much like we average the simulated MSE over many Monte Carlo iterations, which itself is an estimate of the true expected test MSE on unseen data.

The differences between cross-validation (CV) and the bias-variance tradeoff simulation are the following:

  • With CV, we cannot determine the bias component of a statistical learning method since in practice we do not know the true underlying function \(f\);
  • With CV, we cannot determine the variance component of a statistical learning method since in practice we often only have one set of data available and the \(K\) folds typically do not have the same \(X\) values–we cannot properly define variance in this setting;
  • With CV, we can still compute an average MSE over the folds. However, one issue is that the MSE’s on different folds become correlated. As can be seen on the right panel of Figure 2, when computing an MSE on a held-out fold, part of the data that are used to train the learning method is also used to the train learning methods that are to make predictions on the other folds! Each MSE on a given fold is still unbiased, so the overall estimated MSE given by Equation (4) is still unbiased; however, its variance does not have the same properties as that of Equation (1). The repercussion of this is that when one tries to use cross-validation to tune a statistical learning method, they might not hit the true sweet spot because the curve/surface of MSE vs hyperparameter(s) estimated by CV might not follow the same curve/surface estimated by Monte Carlo simulations. (Actually, even if one did find the same set of hyperparameters that optimizes the bias-variance tradeoff, they still might or might not achieve the best predictive performance on a random set of unseen test data. This is because the optimal set of hyperparameters only on average minimizes the predictive MSE, but such guarantee will succumb to the variations in individual test sets. We shall observe these points in simulations.)
Conceptual Relationship between Bias-Variance Tradeoff Simulation and Cross-Validation

Figure 2: Conceptual Relationship between Bias-Variance Tradeoff Simulation and Cross-Validation

0.3 Simulation Studies

The conceptual underpinnings of this project have been covered in detail, which leads to a simple goal of the experimental components: we set out to verify the points stated in the previous section by experimenting with a selection of functions \(f\) as well as statistical learning methods of interest. Due to space constraints, we only present part of the experiment by omitting some results from certain functions \(f\) and from most error distributions other than the Normal. As a sanity check, we also verify numerically whether Equation (1) holds. Such numerical results are also suppressed in this report for tidiness. The interested reader can refer to the codes for full details.

0.3.1 Smoothing Splines

We start with a univariate analysis with smoothing splines as the family of statistical learning method. We use three functions with increasing flexibility:

\[ f_1 = 2.646429 + 0.06816071 * x + 0.0003169643 * x^2 \] \[ f_2 = 4.665886 - 0.042422 * x + 0.003422104 * x^2 - 0.00002854773 * x^3 \] \[ \text{and}~~ f_3 = 29.61689 - 1.389464*x + 0.024976*x^2 + 0.0002523139*x^3 - 0.000008158508*x^4 + 4.110106e-8*x^5 \] We draw \(X\) from a univariate uniform distribution and the error from a normal distribution and a variety of other distributions with the same variance. As mentioned earlier, the particular distribution of error does not play a huge role and for space we just show the results from a Double Exponential distribution along with those from the Normal distribution. Along with the bias-variance tradeoff curves, we also overlay the cross-validated MSE curve obtained from a separate training set. Furthermore, we draw an independent test set and fit the learning method with the grid of hyperparameters and record the corresponding test MSE. This is “bogus” in the sense that in real scenarios we will only use the optimal hyperparameter found from CV to train the method and use it to predict on the test set. We wouldn’t have been able to find the MSE since the response values are typically unknown. Here, however, for pedagogical purposes, we can generate the test set from simulation and compute the “bogus” test MSE and compare that to the CV MSE as well as the true MSE. We will also get a sense of how close we would have been to the (bogus, but practical) optimum for that particular test set if we used the hyperparameter from CV.

Splines with Normal and with Double Expenential ErrorsSplines with Normal and with Double Expenential ErrorsSplines with Normal and with Double Expenential ErrorsSplines with Normal and with Double Expenential ErrorsSplines with Normal and with Double Expenential ErrorsSplines with Normal and with Double Expenential Errors

Figure 3: Splines with Normal and with Double Expenential Errors

The left column of Figure 3 shows results from Normal errors. The first row in each subplot shows the true function \(f\) and the fitted spline. The first two fitted splines are wigglier than the true functions because the signal-to-noise ratio is apparently low for the first two cases. The fitted function adheres more closely to the true function in the case of \(f_3\). However, it is interesting to note that the wiggliness does not impair the fitted splines’ ability to make sound predictions, as we observe in the second row in each of the subplots, where the minimum true MSE attained (in red) is very close to the irreducible error (=2). We note that as the degrees of freedom increases, bias decreases and variance increases, resulting in the characteristic Nike shape for the total expected true MSE. The black curve shows the cross-validated MSE, and we see that for \(f_3\), the hyperparameter (DF for smoothing splines) chosen by CV coincides with the sweet spot in the bias-variance tradeoff, but for \(f_1\) and \(f_2\), it does not. Even when it does, when used for one specific test set, the smoothing spline might not be performing at its very best (purple curve) due to reasons explained earlier. Overall, as the underlying function becomes more complex, higher degrees of freedom in the smoothing spline is selected by cross-validation.

One thing to point out is that both the cross-validated MSE and the MSE on one single test set took values lower than the irreducible error, whereas theory dictates that the expected true MSE cannot. How can we explain this?

For the cross-validated MSE, while it is true that the CV folds do not share the same values of the predictors \(X\), and as a result the bias and the variance are ill-defined, thus making Equation (1) inapplicable for the cross-validated MSE, it is still an unbiased estimator for the true MSE since each fold produces an MSE on unseen data. As we alluded to earlier, its expectation over a large set of observations and many folds is unbiased, albeit with larger variance (induced by the positive correlation between individual terms in Equation (4)). As such, if the expectation is estimated accurately, the cross-validated MSE should still hover over the line of irreducible error. Here the only reason it was able to attain lower values is simply due to the small scale on which simulations are made: there were only hundreds of observations, and with a ten-fold cross-validation (as we typically do), we only have tens of observations to train and test on for each fold division. The smaller effective sample size coupled with the larger variance are thought to be the culprit for the simulation errors, resulting in the deceptive illusion that the expected cross-validated MSE can be lower than the irreducible error. If we increase the sample size, we should eventually see that the line of irreducible error still bounds the cross-validated MSE from below.

For the MSE on one single test set, it’s a different story. note that the \(MSE\) in Equation (1) is a random variable, and as such it has its own variance, which the same equation does not address. All it guarantees is that over many replications of the experiment, the MSE must on average be no less than the irreducible error. An individual experiment, however, is not subject to the law of large numbers and can be higher or lower than the expectation.

The right column of Figure 3 shows results from Double Expenential errors. Similar behaviors are observed of the MSE curves. The same can be said for a variety of other error distributions.

0.3.2 Penalized Regression

From here on we cruise into multivariate terrains. The predictors (45 of them) are now drawn from a multivariate uniform distribution, and the true functions now take a vector \(X\) as input and produce a univariate \(Y\) as output. For simplicity, we only draw errors from a Normal distribution. We entertain four different functions whose details are not as important as their qualitative features, as we decribe below (the interested reader can refer to the codes for details).

  • \(f_1\): Purely linear with all predictors contributing equally to the response.
  • \(f_2\): Purely linear but with only 2 contributing (equally) to the response.
  • \(f_3\): Only 2 predictors contribute linearly, and some predictors contribute quadratically to the response.
  • \(f_4\): Only 2 predictors contribute linearly, and some predictors contribute both quadratically and cubically to the response.
Penalized Regression (Ridge and Lasso)Penalized Regression (Ridge and Lasso)Penalized Regression (Ridge and Lasso)Penalized Regression (Ridge and Lasso)Penalized Regression (Ridge and Lasso)Penalized Regression (Ridge and Lasso)Penalized Regression (Ridge and Lasso)Penalized Regression (Ridge and Lasso)

Figure 4: Penalized Regression (Ridge and Lasso)

Figure 4 shows the results of applying penalized regression methods on the data. The left column shows the curves for ridge regression and the right column shows the curves for the lasso, both as a function of \(\lambda\), the shrinkage parameter. For \(f_1\), both ridge and lasso agree on no shrinkage, which reduces to linear regression. This is expected, since the true function is purely linear in all predictors. For \(f_2\), lasso decides to shrink the estimator as it gets some sense that the true function is sparse in its contributing predictors. Ridge, on the other hand, still prefers no shrinkage. This is because in ridge regression, coefficient estimates will never be shrunk to zero and as such it cannot perform variable selection. For \(f_3\), similar choices were made by ridge and lasso methods, likely because the quadratic contributions are not large enough to influence the fit. For \(f_4\), however, both methods realize that a pure linear model is not adequate to capture the underlying function and incurs too much variance at the same time. Therefore they decide to shrink the estimates while giving up more on the bias. Overall, as the function becomes more nonlinear, both the ridge and the lasso do worse in terms of MSE since they are, at heart, linear models.

It is noteworthy that while the bias does increase with larger \(\lambda\) values, the variance in some cases also increases along with the bias. This can be surprising, but this is only a local behavior. As \(\lambda \to \infty\), all coefficient estimates will be shrunk to zero, and so will be the variance. The tradeoff between the bias and the variance does not have to be a “zero-sum” game–they in general present a tradeoff but sometimes can both lose (or win). When we do observe the tradeoff, that is when the characteristic Nike shape appears and we have a non-zero \(\lambda\) value indicating the need for shrinkage. When shrinkage is not needed, the variance in our simulations increases a bit along with bias for \(\lambda > 0\), so as to “lock in” the optimal \(\lambda\) value of 0. This is another way to think about the behaviors observed in these plots.

0.3.3 Tree-Based Methods (Boosting)

There are two important families of tree-based methods: random forest, of which bagging is a special case, and boosting. Upon close study of these methods, we quickly realize that the bagging and random forest methods do not have tuning parameters, unless we count the choice of predictor subset size \(m\) as such, which is typically not tuned. Boosting, on the other hand, has several tuning parameters, namely the number of trees \(B\), the shrinkage parameter \(\lambda\), and the number \(d\) of splits in each tree. Of these, we typically have empirically set values for \(\lambda\), which represents the speed at which learning occurs and for computational speedups for this project is set to 0.1, and \(d\), which is set to 4. Thus we are left with just one hyperparameter, \(B\). To further speed up the computations, we choose to have 10 predictors instead of 45.

BoostingBoostingBoostingBoosting

Figure 5: Boosting

Figure 5 shows the bias-variance tradeoff curves for boosting for each of the same (up to the number of predictors) functions as before. A hallmark of these curves is that the MSE does not exhibit so much of the characteristic “Nike” shape observed in previous statistical learning methods as a “hinge” shape. The significance of this is that even though the minimum MSE could be achieved at a high number of trees, virtually all of the benefits of having a multitude of trees is reaped with a moderate number of trees. This behavior is observed for all of the four functions. Therefore, the statistician should be mindful when selecting the optimal hyperparameter by cross-validation in the case of boosting, since a lower number of trees achieving essentially the same predictive MSE while saving a ton of computational resources is generally preferable to a higher number of trees consuming too much computational resources with diminishing return. Moreover, CV might not even estimate the absolute sweet spot for the bias-variance tradeoff very accurately due to the flat bottom the curve hits after a drastic initial decrease in MSE. On the other hand, as can be seen in the plots, CV will do well if instead used to estimate the threshold of hyperparameter \(B\) beyond which most of the decrease in MSE is achieved, thanks to how closely the cross-validated MSE adheres to the true expected MSE in the bias-variance tradeoff. The moral of the story is to use just enough trees and no more.

Overall, in contrast to the penalized regression methods, which suffers from non-linearity, boosting suffers from complexity: largest MSE results from \(f_1\), the function with all 10 predictors present.

0.3.4 Support Vector Machine (SVM)

The book only treats the SVM in the classification setting, but the same family of methods can be applied in the regression setting as well, as we have here. Compared to the penalized regression and tree-based methods, a major difference of tuning the SVM from the previous methods is in the number of hyperparameters (tuning knobs) accessible to the statistician. In our case, we use the SVM with the radial kernel and two hyperparameters: \(cost\) and \(gamma\). A side effect of this is that we also explore a bit of three-dimensional data visualization for the bias-variance tradeoff surface. To speed up the computations, we use only 5 predictors, but the same ideas apply for any number of predictors.

The interactive plots above display the results from SVM for the same four functions (up to the number of predictors). The color coding are the same as the previous two-dimensional bias-variance tradeoff curves. As we see, the bias dominates in the total expected MSE. The cross-validated \(gamma\) agrees with the true optimal \(gamma\), where as the cross-validated \(cost\) deviates slightly from the true optimal \(cost\). Interestingly, the hyperparameters chosen by CV agrees perfectly with the (“bogus”) MSE on a specific test set. So in hindsight, it wasn’t all that bad! This is due to the fact that the MSE surface exhibits a flat groove along the \(cost\) dimension.

Overall, similar to boosting and in contrast to the penalized regression methods, which suffers from non-linearity, the SVM also suffers from complexity: largest MSE results from \(f_1\), the function with all 5 predictors present.

0.4 Conclusion and Future Work

Through reading the ISLR book with the project on the bias-variance tradeoff as a guide, we have studied the inner workings of a wide array of statistical learning methods, including many that are not selected for the project. Many of these methods are taking the center stage in solving many challenging but interesting problems. We then move on by treating these methods as black boxes and simulate the bias-variance tradeoff curves/surfaces thereof. The fact that the bias-variance tradeoff can serve as a unifying viewpoint from which a wide gamut of statistical learning methods are understood is satisfying. It is through carrying out the simulations in R that a deeper understanding of this central thesis is gained. The intuition gained can be helpful for a practitioner who might routinely perform cross-validation in an attempt to tune the statistical learning method with respect to the data. A side task of the project is data visualization, which has allowed for intuitive understanding of the interplay between various quantities.

Future work can be focused on exploring a wider selection of data settings as well as other important statistical learning methods. As well, cross-validation can also be applied to many exciting real data problems and further appreciation of the underlying bias-variance tradeoffs can be gained through these hands-on experiences.

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning : With Applications in r. 1st ed. New York, NY: Springer. https://www.statlearning.com/.
Kraft, Robin. 2019. Simulating the bias-variance tradeoff in R.” https://www.r-bloggers.com/2019/06/simulating-the-bias-variance-tradeoff-in-r/.